Exercise 1: Melbourne housing
- Read in a copy of the Melbourne housing data from Nick Tierney’s github repo which is a collation from the version at kaggle. Its fairly large, so let’s start simply, and choose two suburbs to focus on. I recommend “South Yarra” and “Brighton”. (Note: there are a number of missing values. I recommend removing these before making plots.)
- Make a scatterplot matrix of price, rooms, bedroom2, bathroom, suburb, type. The plot will be easier to read if you put the numerical variables first, and then the categorical variables. What are the associations that can be seen?
- positive linear association between price, rooms, bedroom2, bathroom, which indicates the bigger the house the higher the price
- From the boxplots: houses in Brighton tend to be higher priced and bigger than South Yarra, and houses tend to be worth more than apartments or units.
- From the fluctuation diagram, Brighton tends to have more houses, and South Yarra has more apartments.
- From the density plot, price has a skewed distribution.
- There is one big outlier, one property sold for a much higher price.
- Subset the data to South Yarra only. Make an interactive scatterplot matrix of rooms, bedroom2, bathroom and price, coloured by type of property. There is a really high price property. Select this case, and determine what’s special about it – why did it sell for so much? Select the outlier in bedrooms and bathrooms, and examine the other characteristics of this property.
This property that has a high price has relatively modest characteristics! The property with 5 bathrooms for 3 bedrooms is fairly low priced. Maybe there is a mistake in the data and the bedrooms/bathrooms were swapped.
- Examine price vs rooms coloured by bathrooms, faceted by suburb and type, and with a linear model overlaid. What do you learn about average house prices relative to number of rooms and number of bathrooms, for the different property types and suburbs? (Remove the one really high priced property first, because it affects what we can learn about the rest of the data.)
- Three bathrooms, except for houses in Brighton seem to decrease prices as rooms increase!
- There are not many townhouses in South Yarra.
- For Brighton houses there is generally an increasing relationship between rooms and price as number of bathrooms increases.
- For both suburbs, generally units with 2 bathrooms are more highly priced, relative to rooms.
- If we throw all the neighbourhoods in together to analyse price and property characteristics, what pitfall might we encounter?
Simpsons paradox. Suburb is an important factor in property price. The relationship between price and other characteristics are likely to be different by suburb, and this information will be lost.
Exercise 2: Olive oils
Following on from the olive oils example from class, we will explore the oils from the south here.
- Grab a copy of the data, and subset to contain just the samples from region = south (1), and also drop eicosenoic acid, because there is nothing useful about this variable for the southern oils.
- Only looking at areas (1-3), that is not Sicily:
- Make an interactive parallel coordinate plot of the fatty acids (except eicosenoic), where the lines are coloured by area. (Code is provided, code is a bit tricky, but worth it!)
- Look at the data in a tour.
- Describe what you learn about differences between the three areas, whether these are separated. Are some variables more useful for distinguishing the three areas? Are there any outliers?

The three areas are quite different on a combination of palmitoleic, oleic, palmitic, and linoleic acids. There are some possible outliers, that can be found by selecting various lines, and noticing that it has a different trend than other lines.

The three areas are quite distinct. We could distinguish the growing area of the olive oils by examining the fatty acid composition.
- Re-do b. with Sicily. Explain what you learn about Sicily relative to the other areas.
